4 research outputs found

    A Novel Tree Structure for Pattern Matching in Biological Sequences

    Get PDF
    This dissertation proposes a novel tree structure, Error Tree (ET), to more efficiently solve the Approximate Pattern Matching problem, a fundamental problem in bioinformatics and information retrieval. The problem involves different matching measures such as the Hamming distance, edit distance, and wildcard matching. The input is usually a text of length n over a fixed alphabet of size Σ, a pattern P of length m, and an integer k. The output is those subsequences in the text that are at a distance ≤ k from P by Hamming distance, edit distance, or wildcard matching. An immediate application of the approximate pattern matching is the Planted Motif Search, an important problem in many biological applications such as finding promoters, enhancers, locus control regions, transcription factors, etc. The (l, d)-Planted Motif Search is defined as the following: Given n sequences over an alphabet of size Σ, each of length m, and two integers l and d, find a motif M of length l, where in each sequence there is at least an l-mer (substring of length l) at a Hamming distance of ≤ d from M. Based on the ET structure, our algorithm ET-Motif solves this problem efficiently in time and space. The thesis also discusses how the ET structure may add efficiency when it comes to Genome Assembly and DNA Sequence Compression. Current high-throughput sequencing technologies generate millions or billions of short reads (100-1000 bases) that are sequenced from a genome of millions or billions bases long. The De novo Genome Assembly problem is to assemble the original genome as long and accurate as possible. Although high quality assemblies can be obtained by assembling multiple paired-end libraries with both short and long insert sizes, the latter is costly to generate. Moreover, the recent GAGE-B study showed that a remarkably good assembly quality can be obtained for bacterial genomes by state-of-the-art assemblers run on a single short-insert library with a very high coverage. This thesis introduces a novel Hierarchical Genome Assembly (HGA) method that takes further advantage of such high coverage by independently assembling disjoint subsets of reads, combining assemblies of the subsets, and finally re-assembling the combined contigs along with the original reads. We empirically evaluate this methodology for eight leading assemblers using seven GAGE-B bacterial datasets consisting of 100bp Illumina HiSeq and 250bp Illumina MiSeq reads with coverage ranging from 100x-∼200x. The results show that HGA leads to a significant improvement in the quality of the assembly for all evaluated assemblers and datasets. Still, the problem involves a major step which is overlapping the ends of the reads together and allowing few mismatches (i.e. the approximate matching problem). This requires computing the overlaps between the ends of all-against-all reads. The computation of such overlaps when allowing mismatches is intensive. The ET structure may further speed up this step. Lastly, due to the significant amount of DNA data generated by the Next- Generation-Sequencing machines, there is an increasing need to compress such data to reduce the storage space and transmission time. The Huffman encoding that incorporates DNA sequence characteristics proves to better compress DNA data. Different implementations of Huffman trees, centering on the selection of frequent repeats, are introduced in this thesis. Experimental results demonstrate improvement on the compression ratios for five genomes with lengths ranging from 5Mbp to 50Mbp, compared with the use of a standard Huffman tree algorithm. Hence, the thesis suggests an improvement on all DNA sequence compression algorithms that employ the conventional Huffman encoding. Moreover, approximate repeats can be compressed and further improve the results by encoding the Hamming or edit distance between these repeats. However, computing such distances requires additional costs in both time and space. These costs can be reduced by using the ET structure

    Additional file 1 of HGA: de novo genome assembly method for bacterial genomes using high coverage short sequencing reads

    No full text
    Supplementary tables, experiments, and descriptions. The file contains a detailed results of the assembly of the genomes and assembler which were used in this research. Also, the results of some experiments that were performed to explain the method and the improvement showed in the manuscript. In addition, the file includes a descriptions of the datasets and the metrics which were considered in the manuscript. Lastly, the commands that were used by the different assemblers to run the assemblies. (DOCX 600 kb

    Does FinTech adoption increase the diffusion rate of digital financial inclusion? A study of the banking industry sector

    No full text
    Purpose: The purpose of this study is to discuss the United Arab Emirates’ (UAE) favorable attitude toward the financial sector’s digital transformation and the development of FinTech due to the rise of financial technology. FinTech blends innovation and technology to provide financial inclusion to stakeholders through various new products and services such metaverse and artificial intelligence. Design/methodology/approach: A quantitative research approach was used to empirically validate the suggested research model by using 260 Emirates-based banking authorities and administrators’ data. Findings: The findings indicate that FinTech adoption had a substantial impact on the competitiveness and performance of the UAE banking industry during COVID-19 times. The research indicates that adequate FinTech implementation and alignment with technology management directly influence the performance of the UAE’s banking sector in difficult times. Originality/value: This study is critical because the UAE banking sector serves diverse nationalities, and its success is contingent on FinTech and its competitive edge
    corecore